NetNews Offline 2

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Offline 2 / NetNews Offline Volume 2.iso / news / comp / lang / c++-part2 / 14276 < prev next >

Wrap

Internet Message Format | 1996-08-05 | 3.3 KB

Path: ix.netcom.com!news From: jlilley@ix.netcom.com (John Lilley) Newsgroups: comp.lang.c++ Subject: Re: Special sort algorithm Date: 29 Mar 1996 16:05:48 GMT Organization: Netcom Message-ID: <4jh1os$scq@dfw-ixnews6.ix.netcom.com> References: <31593E09.29A0@kmd.dk> NNTP-Posting-Host: den-co11-03.ix.netcom.com Mime-Version: 1.0 Content-Type: Text/Plain; charset=US-ASCII X-NETCOM-Date: Fri Mar 29 10:05:48 AM CST 1996 X-Newsreader: WinVN 0.99.7 >It must : > - Sort big files of short lines > lines are uneven length > files are XX MB > lines are < 100 chars > - Put resulting file in original file > - Run in very little memory ( < 50 K) > - Use about no extra disk space ( < 10% extra) > - Run reasonably fast > >I have tried implementing in-file bubblesort and others with the language >being C++. But they are way too slow, even with some optimizing buffers. Have you tried in-place QuickSort? It would only be NlogN instead of N*N. But assuming you've tried that... This will work with about 20% extra disk space, and should be faster than the in-place quicksort. Since you've got the disk space limitation, I would try the following: 1) Create an index file for the file you want to sort (e.g., a file containing binary 4-byte integers, one per line, set to the offset within the source file of the beginning of each line. 2) Sort the index file (i.e. treat the indices as pointers into the source file and sort the indices such that the corresponding line are sorted), probably using a disk-based mergesort. For files with an everage line length > 40 characters, the index file will be 10% of the source file size. The intermediate files for the mergesort will take an additional 10%. It will still be slow because you will be performing NlogN random reads in the source on the second and subsequent passes of the mergesort, but that is hopefully better than NlogN reads AND writes. The mergesort itself requires NlogN *sequential* reads/writes of the index files, which is far better than random accesses. 3) Once the index file is sorted, create the sorted source file based on the index file. This will involve N random reads/writes in the source file, and another 10N sequential reads/writes (which is hopefully better than the NlogN random reads/writes needed by an in-place quicksort). You will also need to use another disk buffer (but you already had to use another disk buffer for the mergesort intermediate files, so you might as well reuse that space). Walk through the sorted index file, and for each line indexed by those integers, find that line in the source file, and write it over the beginning of the source file. As you do so, to avoid destroying lines that would be overwritten, write them to the disk buffer and keep track of M, where M is the number of lines that have been moved to the disk buffer. When the disk buffer reaches its 10% limit, compact the source file by removing all lines that have been moved to the beginning, adjust the indices, and then append the lines from the disk buffer that have not been written to the beginning of the source file, and adjust *those* indices as well. Keep going until all indices have been processed. Disclaimer: This is ad-hoc. There is almost certainly a better way to do this described somewhere in the literature. john lilley